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Abstract —In this paper, we consider discrete-time infinite 
horizon problems of optimal control to a terminal set of states. 
These are the problems that are often taken as the starting 
point for adaptive dynamic programming. Under very general 
assumptions, we establish the uniqueness of solution of Bellman’s 
equation, and we provide convergence results for value and policy 
iteration. 


I. Introduction 

In this paper we consider a deterministic discrete-time optimal 
control problem involving the system 

Xk+I = f{xk,Uk), fc = 0,l,..., (1) 

where Xk and Uk are the state and control at stage k, lying in 
sets X and U, respectively, and / is a function mapping XxU 
to X. The control Uk must be chosen from a constraint set 
U{xk) C U that may depend on the current state Xk- The cost 
for the fcth stage, denoted g{xk,Uk), is assumed nonnnegative 
and may possibly take the value oo: 

^ < g{xk,Uk) <oo, Xk € X, Uk €U{xk), (2) 

[values g{xk,Uk) = oo may be used to model constraints 
on Xk, for example]. We are interested in feedback policies 
of the form tt = {gQ,gi ,...}, where each fj,k is a function 
mapping every x G X into the control g,k{x) G U{x). The 
set of all policies is denoted by 11. Policies of the form tt = 
{/i, fr,...} are called stationary, and for convenience, when 
confusion cannot arise, will be denoted by g. No restrictions 
are placed on X and U: for example, they may be finite sets as 
in classical shortest path problems involving a graph, or they 
may be continuous spaces as in classical problems of control 
to the origin or some other terminal set. 

Given an initial state xq, a policy tt = • ■ •} when 

applied to the system O, generates a unique sequence of state 
control pairs {xk, gk{xk)), fc = 0, 1,..., with cost 

k 

J-„[xq) = gixt,gt{xt)), xqGX, ( 3 ) 

k—¥oo 

[the limit exists thanks to the nonnegativity assumption ©j. 
We view J,r as a function over X that takes values in [0, oo]. 
We refer to it as the cost function of tt. For a stationary 
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policy g, the corresponding cost function is denoted by J^. 
The optimal cost function is defined as 


J*{x) = inf J.k{x), X G X, 

ttGII 

and a policy tt* is said to be optimal if it attains the minimum 
of JTr{x) for all X G X, i.e.. 


Jt^*{x) = inf J.n.(x) = J*{x), y X G X. 

ttGII 


In the context of dynamic programming (DP for short), 
one hopes to prove that the optimal cost function J* satisfies 
Bellman’s equation: 


J*{x)= inf {g{x,u) + J*{f{x,u))}, V x G X, (4) 

u€U{x) 

and that an optimal stationary policy may be obtained through 
the minimization in the right side of this equation. Note that 
Bellman’s equation generically has multiple solutions, since 
adding a positive constant to any solution produces another 
solution. A classical result, stated in Prop. 4(a) of Section II, 
is that the optimal cost function J* is the “smallest” solution 
of Bellman’s equation. In this paper we will focus on deriving 
conditions under which J* is the unique solution within a 
certain restricted class of functions. 

In this paper, we will also consider finding J* with the 
classical algorithms of value iteration (VI for short) and policy 
iteration (PI for short). The VI algorithm starts from some 
nonnegative function Jq : X i—>■ [0,oo], and generates a 
sequence of functions {Jk} according to 

Jfc+i = inf {g{x,u) + Jk{f{x,u))}. (5) 

uGU{x) 

We will derive conditions under which Jk converges to J* 
pointwise. 

The PI algorithm starts from a stationary policy fjP, and gen¬ 
erates a sequence of stationary policies {/i^} via a sequence 
of policy evaluations to obtain from the equation 

Jfj-^ix) = g{x,g’"{x)) + Jf,k{f{x,g’^{x))), x G X, (6) 

interleaved with policy improvements to obtain from J^k 
according to 


G aigmin {g{x,u) -f J^k{f{x,u))}, 

u^U{x) 


X G X. 

(7) 


We implicitly assume here is that J^k satisfies Eq. (|6l), which 
is true under the cost nonnegativity assumption (|2]l (cf. Prop. 
4 in the next section). Also for the PI algorithm to be well- 
defined, the minimum in Eq. (|7]) should be attained for each 
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X £ X, which is true under some conditions that guarantee 
compactness of the level sets 

{w G U{x) I g{x, u) + J^k {f{x, u)) < a}, a G 5i. 

We will derive conditions under which J^k converges to J* 
pointwise. 

In this paper, we will address the preceding questions, for 
the case where there is a nonempty stopping set Xg C X, 
which consists of cost-free and absorbing states in the sense 
that 

g{x,u) = 0, X = f{x,u), y X £ Xg, u£U{x). 

( 8 ) 

Clearly, J*{x) = 0 for all x G Xg, so the set Xg may be 
viewed as a desirable set of termination states that we are 
trying to reach or approach with minimum total cost. We will 
assume in addition that J*{x) > 0 for x ^ Xg, so that 

X, = {xGX| J*(x)=0}. (9) 

In the applications of primary interest, g is usually taken to 
be strictly positive outside of Xg to encourage asymptotic 
convergence of the generated state sequence to Xg, so this 
assumption is natural and often easily verifiable. Besides Xg, 
another interesting subset of X is 

Xf = {x G X I J*(x) < cx)}. 

Ordinarily, in practical applications, the states in Xf are 
those from which one can reach the stopping set Xg, at least 
asymptotically. 

For an initial state x, we say that a policy tt terminates 
starting from x if the state sequence {xk} generated starting 
from X and using tt reaches Xg in finite time, i.e., satisfies 
xj. £ Xg for some index k. A key assumption in this paper is 
that the optimal cost J*{x) (if it is finite) can be approximated 
arbitrarily closely by using policies that terminate from x. In 
particular, in all the results and discussions of the paper we 
make the following assumption (except for Prop. 5, which 
provides conditions under which the assumption holds). 

Assumption 1. The cost nonnegativity condition 0 and 
stopping set conditions (lU-IPll hold. Moreover, for every pair 
(x, e) with X £ Xf and e > 0, there exists a policy tt that 
terminates starting from x and satisfies (x) < J* (x) + e. 

Specihc and easily verifiable conditions that imply this 
assumption will be given in Section IV. A prominent case 
is when X and U are hnite, so the problem becomes a deter¬ 
ministic shortest path problem with nonnegative arc lengths. If 
all cycles of the state transition graph have positive length, all 
policies TT that do not terminate from a state x £ Xf must 
satisfy J 7 r(x) = oo, implying that there exists an optimal 
policy that terminates from all x G Xf. Thus, in this case 
Assumption [T] is naturally satisfied. 

When X is the n-dimensional Euclidean space a pri¬ 
mary case of interest for this paper, it may easily happen that 
the optimal policies are not terminating from some x £ Xf, 
but instead the optimal state trajectories may approach Xg 
asymptotically. This is true for example in the classical linear- 
quadratic optimal control problem, where X = 5ft", = {0}, 


U = S’", g is positive semidefinite quadratic, and / represents 
a linear system of the form Xk+i = Axk + Buk, where A and 
B are given matrices. However, we will show in Section IV 
that Assumption [T] is satisfied under some natural and easily 
verifiable conditions. 

Regarding notation, we denote by 5ft and 5ft" the real line 
and n-dimensional Euclidean space, respectively. We denote 
by i?+(X) the set of all functions J ; X i—>• [0, oo], and by 3 
the set of functions 

d={j £E+{X)\J{x)=0,y x£Xg). (10) 

Since X^ consists of cost-free and absorbing states [cf. Eq. 
®], the set 3 contains the cost function of all policies tt, 
as well as J*. In our terminology, all equations, inequalities, 
and convergence limits involving functions are meant to be 
pointwise. Our main results are given in the following three 
propositions. 

Proposition 1 (Uniqueness of Solution of Bellman’s Equa¬ 
tion). Let Assumption\J]hold. The optimal cost function J* is 
the unique solution of Bellman’s equation (0 within the set 
of functions 3- 

There are well-known examples where g > 0 but Assump¬ 
tion [H does not hold, and there are additional solutions of 
Bellman’s equation within 3- The following is a two-state 
shortest path example, which is discussed in more detail in 
[12], Section 3.1.2, and [14], Example 1.1. 

Example 1 (Counterexample for Uniqueness of Solution of 
Bellman’s Equation). Let X = {0,1}, where 0 is the unique 
cost-free and absorbing state, Xg = {0}, and assume that at 
state 1 we can stay at 1 at no cost, or move to 0 at cost 1. 
Here J*(0) = J*(l) = 0, so Eq. ^ is violated. It can be 
seen that 

3 = {J| J*(0) = 0, J*(1)>0}, 

and that Bellman’s equation is 

J*(0) = J*(0), J*(l) =min{j*(l), 1-f J*(0)}. 

It can be seen that Bellman’s equation has infinitely many 
solutions within 3, the set {J | J(0) = 0, 0 < J(l) < l}. 

Proposition 2 (Convergence of VI). Let Assumption [7] hold. 

(a) The VI sequence {Jk} generated by Eq. (|5ll converges 
pointwise to J* starting from any function Jq G 3 with 

Jo > J*- 

(b) Assume further that U is a metric space, and the sets 
Uk{x,\) given by 

Uk{x,\) = {m G U{x) I g{x,u) -£ Jk{f{x,u)) < A}, 

are compact for all x £ X, X £ 5ft, and k, where {Jk} A 
the VI sequence {Jk} generated by Eq. 0 starting from 
Jo = 0. Then the VI sequence {Jk} generated by Eq. 
0 converges pointwise to J* starting from any function 

Jo £ 3- 
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The compactness assumption of Prop. |2lb) is satisfied if 
U{x) is finite for all x G X. Other easily verifiable as¬ 
sumptions implying this compactness assumption will be given 
later. Note that when there are solutions to Bellman’s equation 
within 3, in addition to J*, VI will not converge to J* starting 
from any of these solutions. However, it is also possible that 
Bellman’s equation has J* as its unique solution within 3, and 
yet VI does not converge to J* starting from the zero function 
because the compactness assumption of Prop.|2jb) is violated. 
There are several examples of this type in the literature, and 
the following example, an adaptation of Example 4.3.3 of [12], 
is a deterministic problem for which Assumption[T]is satisfied. 


Example 2 (Counterexample for Convergence of VI). Let 
X = [0,oo) U {s}, with s being a cost-free and absorbing 
state, and let U = (0, oo) U {u}, where u is a special stopping 
control, which moves the system from states x > 0 to state s 
at unit cost. The system has the form 

{ Xk + Uk if Xk > 0 and Uk u, 

s if Xk > 0 and Uk = u, 

s if Xk = s and Uk G U. 

The cost per stage has the form 


{ Xk if Xk >0 and Uk f u, 
1 if Xk>0 and Uk = u, 
0 if Xk = s and Uk G U. 

Let also Xg = {s}. Then it can be verified that 


r{x) 


1 if X >0, 
0 if X = s. 


and that an optimal policy is to use the stopping control u 
at every state (since using any other control at states a; > 0, 
leads to unbounded accumulation of positive cost). Thus it can 
be seen that Assumption Q] is satisfied. On the other hand, the 
VI algorithm is 

Jfc+i(x) = min |l + Jfe(s), mf {a; + Jk{x + u)}| 

for X > 0, and Jk+i{s) = Jkis), and it can be verified by 
induction that starting from Jq = 0, the sequence {Jfc} is 
given for all k by 


Jk (,x') 


min{l, kx} if x>D, 
0 if X = s. 


Thus Jk(0) — 0 for all k, while J*(0) = 1, so the VI algorithm 
fails to converge for the state x = Q. The difficulty here is that 
the compactness assumption of Prop, ^b) is violated. 


Proposition 3 (Convergence of PI). Let Assumption |7] hold. 
A sequence generated by the PI algorithm 0, 

satisfies f J*{x) for all x G X. 

It is implicitly assumed in the preceding proposition that the 
PI algorithm is well-defined in the sense that the minimization 
in the policy improvement operation 0 can be carried out for 
every x G X. Easily verifiable conditions that guarantee this 


also guarantee the compactness condition of Prop. 12b), and 
will be noted following Prop. 4 in the next section. Moreover, 
in Section IV we will prove a similar convergence result for 
a variant of the PI algorithm where the policy evaluation is 
carried out approximately through a finite number of Vis. 

Example 3 (Counterexample for Convergence of PI). For a 
simple example where the PI sequence J^k does not converge 
to J* if Assumption Q] is violated, consider the two-state 
shortest path Example \2} Let p, be the suboptimal policy that 
moves from state 1 to state 0. Then Jfj,(0) = 0, = 1. ond 

it can be seen that p satisfies the policy improvement equation 

p{l) G argmin {1 -f J^(0), J^(l)}. 

Thus PI may stop with the suboptimal policy p. 

The results of the preceding three propositions are new at 
the level of generality given here. Eor example there has been 
no proposal of a valid PI algorithm in the classical literature on 
nonnegative cost infinite horizon Markovian decision problems 
(exceptions are special cases such as linear-quadratic problems 
[23]). The ideas of the present paper stem from a more general 
analysis regarding the convergence of VI, which was presented 
recently in the author’s research monograph on abstract DP 
[Berl2], and various extensions given in the recent papers [13], 
[14]. Two more papers of the author, coauthored with H. Yu, 
deal with issues that relate in part to the intricacies of the 
convergence of VI and PI in undiscounted infinite horizon DP 
[35], [5]. 

The paper is organized as follows. In Section II we provide 
background and references, which place in context our results 
and methods of analysis in relation to the literature. In Section 
III we give the proofs of Props. [T]l3] In Section IV we discuss 
special cases and easily verifiable conditions that imply our 
assumptions, and we provide extensions of our analysis. 

II. Background 

The issues discussed in this paper have received attention since 
the 60’s, originally in the work of Blackwell [15], who consid¬ 
ered the case g < 0, and the work by Strauch (Blackwell’s PhD 
student) [30], who considered the case g > 0. Eor textbook 
accounts we refer to [2], [25], [11], and for a more abstract 
development, we refer to the monograph [12]. These works 
showed that the cases where g < 0 (which corresponds to 
maximization of nonnegative rewards) and g > 0 (which is 
most relevant to the control problems of this paper) are quite 
different in structure. In particular, while VI converges to J* 
starting for Jq = 0 when g < 0, this is not so when g > 0; 
a certain compactness condition is needed to guarantee this 
[see Example |2] and part (d) of the following proposition]. 
Moreover when g > 0, Bellman’s equation may have solutions 
J f J* with J > J* (see Example [TJ, and VI will not 
converge to J* starting from such J. In addition it is known 
that in general, PI need not converge to J* and may instead 
stop with a suboptimal policy (see Example 0. 

The following proposition gives the standard results when 
g > 0 (see [2], Props. 5.2, 5.4, and 5.10, [11], Props. 
4.1.1, 4.1.3, 4.1.5, 4.1.9, or [12], Props. 4.3.3, 4.3.9, and 
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4.3.14). These results hold for stochastic infinite horizon DP 
problems with nonnegative cost per stage, and do not take 
into account the favorable structure of deterministic problems 
or the presence of the stopping set Xg. 

Proposition 4. Let the nonnegativity condition (|3 hold. 


(a) J* satisfies Bellman’s equation and if J G E'^{X) 
is another solution, i.e., J satisfies 


J{x)= M {g{x,u) + j{f{x,u))}, 

uGU{x) 

then .J* < J. 

(b) For all stationary policies p we have 


V X G X, 
( 11 ) 


Jp-ix) = g{x,p{x)) + J^{f{x,p{x))), V X G X. 

( 12 ) 

(c) A stationary policy p* is optimal if and only if 


p*{x) G argmin{5(x, u) +J*(/(x,m))}, 

udU [x) 

(d) If U is a metric space and the sets 


Vx G X. 
(13) 


Uk{x,X) = {u £U{x) \ g{x,u) + Jk{f{x,u)) < A} 

(14) 

are compact for aW x G X, A G 3?, and k, where {Jk} is 
the sequence generated by VI [cf Eq. (|5 | ] starting from 
Jo = 0, then there exists at least one optimal stationary 
policy, and we have Jk J*. 


Compactness assumptions such as the one of part (d) above, 
were originally given in [9], [10], and in [29]. They have been 
used in several other works, such as [3], [11], Prop. 4.1.9. In 
particular, the condition of part (d) holds when U (x) is a finite 
set for all X G X. The condition of part (d) also holds when 
X = [ft”, and for each x G X, the set 

{u G U(x) I g{x,u) < a} 

is a compact subset of [ft'”, for all A G [ft, and g and / are 
continuous in u. The proof consists of showing by induction 
that the VI iterates Jk have compact level sets and hence are 
lower semicontinuous. 

Let us also note a recent result of H. Yu and the author [35], 
where it was shown that J* is the unique solution of Bellman’s 
equation within the class of all functions J G E'^{X) that 
satisfy 

0 < J < cJ* for some c > 0, (15) 


(we refer to [35] for discussion and references to antecedents 
of this result). Moreover it was shown that VI converges to 
J* starting from any function satisfying the condition 

J* < J < cJ* for some c > 0, 

and under the compactness conditions of Prop. IHd), starting 
from any J that satisfies Eq. (ITSl l. The same paper and a related 
paper [5] discuss extensively PI algorithms for stochastic 
nonnegative cost problems. 

For deterministic problems, there has been substantial re¬ 
search in the adaptive dynamic programming literature, regard¬ 
ing the validity of Bellman’s equation and the uniqueness of 


its solution, as well as the attendant questions of convergence 
of VI and PI. In particular, infinite horizon deterministic 
optimal control for both discrete-time and continuous-time 
systems has been considered since the early days of DP in the 
works of Bellman. For continuous-time problems the questions 
discussed in the present paper involve substantial technical 
difficulties, since the analog of the (discrete-time) Bellman 
equation © is the steady-state form of the (continuous¬ 
time) Hamilton-Jacobi-Bellman equation, a nonlinear partial 
differential equation the solution and analysis of which is 
in general very complicated. A formidable difficulty is the 
potential lack of differentiability of the optimal cost function, 
even for simple problems such as time-optimal control of 
second order linear systems to the origin. 

The analog of VI for continuous-time systems essentially 
involves the time integration of the Hamilton-Jacobi-Bellman 
equation, and its analysis must deal with difficult issues of sta¬ 
bility and convergence to a steady-state solution. Nonetheless 
there have been proposals of continuous-time PI algorithms, 
in the early papers [26], [23], [28], [34], and the thesis [6], as 
well as more recently in several works; see e.g., the book [32], 
the survey [18], and the references quoted there. These works 
also address the possibility of value function approximation, 
similar to other approximation-oriented methodologies such 
as neurodynamic programming [4] and reinforcement learning 
[31], which consider primarily discrete-time systems. For 
example, among the restrictions of the PI method, is that it 
must be started with a stabilizing controller; see for example 
the paper [23], which considered linear-quadratic continuous¬ 
time problems, and showed convergence to the optimal policy 
of the PI algorithm, assuming that an initial stabilizing linear 
controller is used. By contrast, no such restriction is needed in 
the PI methodology of the present paper; questions of stability 
are addressed only indirectly through the finiteness of the 
values J*(x) and Assumption [T] 

For discrete-time systems there has been much research, 
both for VI and PI algorithms. For a selective list of recent 
references, which themselves contain extensive lists of other 
references, see the book [32], the papers [19], [16], [17], [22], 
[33], the survey papers in the edited volumes [27] and [21], 
and the special issue [20]. Some of these works relate to 
continuous-time problems as well, and in their treatment of 
algorithmic convergence, typically assume that X and U are 
Euclidean spaces, as well as continuity and other conditions 
on g, special strucmre of the system, etc. It is beyond our 
scope to provide a detailed survey of the state-of-the-art of 
the VI and PI methodology in the context of adaptive DP. 
However, it should be clear that the works in this field involve 
more restrictive assumptions than our corresponding results 
of Props. [T]l3] Of course, these works also address questions 
that we do not, such as issues of stability of the obtained 
controllers, the use of approximations, etc. Thus the results 
of the present work may be viewed as new in that they 
rely on very general assumptions, yet do not address some 
important practical issues. The line of analysis of the present 
paper, which is based on general results of Markovian decision 
problem theory and abstract forms of dynamic programming, 
is also different from the lines of analysis of works in adaptive 
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DP, which make heavy use of the deterministic character of 
the problem and control theoretic methods such as Lyapunov 
stability. 

Still there is a connection between our line of analysis and 
Lyapunov stability. In particular, if tt* is an optimal controller, 
i.e., J^. = J*, then for every xq S Xj, the state sequence 
{xk} generated using tt* and starting from Xq remains within 
Xf and satisfies J*{xk) i 0. This can be seen by writing 

fc-i 

J*{xo) = '^g{xt,fit{xt)) + J*{xk), fc = 1,2,..., 

t=o 

and using the facts g >0 and J*{xo) < oo. Thus an optimal 
controller, restricted to the subset Xf, may be viewed as a 
Lyapunov-stable controller where the Lyapunov function is J*. 

On the other hand, existence of a “stable” controller does 
not necessarily imply that J* is real-valued. In particular, it 
may not be true that if the generated sequence {x^} by an 
optimal controller starting from some xq converges to Xs, 
then we have J*{xo) < oo. The reason is that the cost per 
stage g may not decrease fast enough as we approach Xg- As 
an example, let 

X = {0} U {l/m I TO : is a positive integer}, 

with Xg = {0}, and assume that there is a unique controller, 
which moves from 1 /to to 1 /(to-|- 1) with incurred cost 1 /to. 
Then we have J*(x) = oo for all x ^ 0, despite the fact 
that the controller is “stable” in the sense that it generates a 
sequence {xk} converging to 0 starting from every xq ^ 0. 

III. Proofs of the Main Results 


j{xk) = 0 for all sufficiently large k, we have 

[ . '=-1 1 
limsup < J{xk) + E g{xt,gt{xt)) > 
fc^OO } J 

- J TT (^o) • 

By combining the last two relations, we obtain 

J (^o) — *^(^ 0 ) — '^7r(^o)5 ^ ^0 ^ ^f'> ^ ^ ■ 

Taking the infimum over tt G and using Eq. (fThl l. it 

follows that J*(xo) = J(xo) for all xq € Xf. Also for xq ^ 
Xf, we have J*(xo) = J{xo) = 00 [since J* < J by Prop. 
11} a)], so we obtain J* = J. □ 

Proof of Prop. |2} (a) Suppose that Jg G [J and Jg > J*. 
Starting with Jg, let us apply the VI operation to both sides 
of the inequality Jq > J*. Since J* is a solution of Bellman’s 
equation and VI has a monotonicity property that maintains 
the direction of functional inequalities, we see that Ji > J*. 
Continuing similarly, we obtain Jk > J* for all k. Moreover, 
we clearly have Jfe(x) = 0 for all x G Xg, so Jk € 3 for 
all k. We now argue that since Jk is produced by k steps 
of VI starting from Jg, it is the optimal cost function of the 
A:-stage version of the problem with terminal cost function 
Jq. Therefore, we have for every xq € X and policy tt = 
{jo, Ji,---}, 

k-1 

J^{xo) < Jk{xo) < g{xtj J A: = 1,2, , 

t=o 


Let us denote for all x € X, 

^T,x = {tt G n I TT terminates from x}, 
and note the following key implication of Assumption [1] 

J*(x) = inf J,r(x), V X G Xf. (16) 

In the subsequent arguments, the signihcance of policies 
that terminate starting from some initial state xg is that the 
corresponding generated sequences {x^} satisfy J(xfe) = 0 
for all J G 3 and k sufficiently large. 

Proof of Prop. [TJ Let J G 3 be a solution of the Bellman 
equation (ITTI) . so that 

Jix) < g{x, u) + J[f{x, u)), V X G AT, u G U{x), (17) 

while by Prop. |4}a), J* < J. For any xq G Xf and policy 
TT = .. .} G JIt,xo, we have by using repeatedly Eq. 

([I7]l, 

k-1 

J*(xo) < J(xo) < J{xk)-^'^g{xt,iJ>t{xt)): k = 1 , 2 ,..., 

t^o 

where {x^j is the state sequence generated starting from Xg 
and using tt. Also, since tt G ^t,xo and hence Xk G Xg and 


where {xt} is the state sequence generated starting from xg 
and using tt. If xq G Xf and tt G flT,xo, we have Xk G Xg 
and Jg(xfc) = 0 for all sufficiently large k, so that 

limsup < Jg(xfc) -f '^g{xt, gt{xt)) > 

[ t=o J 

— Jtt (xg ). 

By combining the last two relations, we obtain 

J*(xo) < liminf Jfc(xg) < limsup Jfc(xg) < J,r(a;g), 

k—¥00 

for all Xq G Xf and tt G ^t,xo- Taking the infimum over tt G 
IlT.ajo and using Eq. (IThl l. it follows that limfc_>oo Jk{xo) = 
J*(xg) for all Xg G Xf. Since for xg ^ Xf, we have J*(xg) = 
Jk{xo) = oo, we obtain Jk ^ J*. 

(b) Let {Jfej be the VI sequence generated starting from some 
function J G 3- By the monotonicity of the VI operation, {Jk} 
lies between the sequence of VI iterates starting from the zero 
function [which converges to J* from below by Prop. |4}d)], 
and the sequence of VI iterates starting from Jg = max{ J, J*} 
[which converges to J* from above by part (a)]. □ 
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Proof of Prop. If /j- is a stationary policy and fi satisfies 
the policy improvement equation 

p,{x) G argmin {g{x,u) + J^(/(x,u))}, x G X, 

u^U{x) 

[cf. Eq. ©], we have for all x G X, 

= g{x,Kx)) + J^,{f{x,g.{x))) 

> min {g(a;,u) + J^(/(x, u))} (18) 

u<^U {x) 

= g{x,p{x)) + Jf,{f{x,p{x))), 

where the first equality follows from Prop.HJb) and the second 
equality follows from the definition of p. Let us fix x and let 
{xk} be the sequence generated starting from x and using g. 
By repeatedly applying Eq. (fTsT l. we see that the sequence 
{jfe(x)} defined by 

Jo(x) = Jfj,{x), 

Ji(x) = Jfj,{xi) +g{x,p{x)), 
and more generally, 

k-l 

Jk{x) = Jf^{xk) -\-'^g{xt,p^{xt)), k = 1,2,..., 

is monotonically nonincreasing. Thus, using also Eq. (fTSl) . we 
have 

J^,{x)> niin { 5 (x,u) + J^(/(x,u))} 

w^U yx) 

= Ji{x) 

for all X G X and fc > 1. This implies that 

J^(x) > min { 5 '(x,'u) + J^(/(x,u))} 

u^U {x) 

> lim Jk(x) 

k—¥oo 

k-l 

> lim E g{xt,g{xt)) 

°° t^O 

— 

where the last inequality follows since > 0. In conclusion, 
we have 

Jfj.{x)> inf {g{x,u) + J^(f{x,u))} > Jp,{x), xGX. 

u^U {x) 

Using gl and in place of g and p in the preceding 

relation, we obtain for all x G X, 

J k (x) > inf {^(x, u) + J^k (/(x, u)) } > J k+i (x). 

u^Uyx) 

(19) 

Thus the sequence {J^t} generated by PI converges mono¬ 
tonically to some function Joo G E^{X), i.e., J^k 4- Joo- 
Moreover, by taking the limit as fc —oo in Eq. (fT^ . we have 
the two relations 

Joo(x) > inf {g{x,u) + Joo(f{x,u))}, x G X, 

ueu{x) 


and 

g{x,u) + J^k(^f{x,u)) > Jooix), X G X, uGU{x). 

We now take the limit in the second relation as fc —> oo, then 
the infimum over u G U (x), and then combine with the first 
relation, to obtain 

Joo{x)= inf {p(x,u) + Joo(/(x, u))}, xGX. 

u£U{x) 

Thus Joo is a solution of Bellman’s equation, satisfying Joo G 
3 (since J^k G 3 and J^k Joo), so by the uniqueness result 
of Prop. [T] we have Joo = J*. □ 

IV. Discussion, Special Cases, and Extensions 

In this section we elaborate on our main results and we derive 
easily verifiable conditions under which our assumptions hold. 

A. Conditions that Imply Assumption Q] 

Consider Assumption [T] As noted in Section I, it holds when 
X and U are finite, a terminating policy exists from every 
X, and all cycles of the state transition graph have positive 
length. Eor the case where X is infinite, let us assume that 
X is a normed space with norm denoted || ■ ||, and say that 
TT asymptotically terminates from x if the sequence {x^} 
generated starting from x and using tt converges to Xg in 
the sense that 

lim dist(xfe, ATg) = 0, 

k—foo 

where dist(x,Xs) denotes the minimum distance from x to 

dist(x,Xs)= inf ||x —t/H, x G X. 

The following proposition provides readily verifiable condi¬ 
tions that guarantee Assumption [T] 

Proposition 5. Let the cost nonnegativity condition (O and 
stopping set conditions (0-0 hold, and assume further the 
following: 

(1) For every x G Xf and e > 0, there exits a policy tt that 
asymptotically terminates from x and satisfies 

Ji^ix) < J*(x) -I- e. 

(2) For every e > 0, there exists a > 0 such that for each 
X G Xf with 

dist(x, Xg) < 5e, 

there is a policy tt that terminates from x and satisfies 
J^(x) < e. 

Then Assumption Q] holds. 

Proof: Eix x G Xf and e > 0. Let tt be a policy 
that asymptotically terminates from x, and satisfies J 7 r(x) < 
J*(x) + e, as per condition (1). Starting from x, this policy 
will generate a sequence {x^} such that for some index k we 
have 

dist(xfc,Xg) < 
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SO by condition (2), there exists a policy if that terminates 
from xj. and is such that < e. Consider the policy tt' 

that follows TT up to index k and follows if afterwards. This 
policy terminates from x and satisfies 

J^’ix) = J^j.{x) + J^{xj.) < J^{x) + J^{xf,) < J*{x)+2e, 

where j,{x) is the cost incurred by tt starting from x up to 
reaching x^. □ 

Condition (1) of the preceding proposition requires that 
for states x € Xf, the optimal cost J*{x) can be achieved 
arbitrarily closely with policies that asymptotically terminate 
from X. Problems for which condition (1) holds are those 
involving a cost per stage that is strictly positive outside of 
Xg. More precisely, condition (1) holds if for each i5 > 0 there 
exists e > 0 such that 

inf g{x,u) > e, V x G X such that dist(a;,2fs) > S. 

uGU(x) 

( 20 ) 

Then for any x and policy tt that does not asymptotically 
terminate from x, we will have Jt^(x) = oo, so that if 
X G Xf, all policies tt with J.n(x) < oo must be asymp¬ 
totically terminating from x. In applications, condition (1) 
is natural and consistent with the aim of steering the state 
towards the terminal set Xg with finite cost. Condition (2) 
is a “controllability” condition implying that the state can be 
steered into Xg with arbitrarily small cost from a starting state 
that is sufficiently close to Xg. 

Example 4 (Linear System Case). Consider a linear system 
Xk+i = Axk + Buk, 

where A and B are given matrices, with the terminal set being 
the origin, i.e., Xg = {0}. We assume the following: 

(a) X = 5R", U = 5ft™, and there is an open sphere R 
centered at the origin such that U (x) contains R for all 
xGX. 

(b) The system is controllable, i.e., one may drive the system 
from any state to the origin within at most n steps 
using suitable controls, or equivalently that the matrix 
[B AB ■ ■ ■ A'^~^B] has rank n. 

(c) g satisfies 

0 < g{x,u) < l3{\\x\\P + ||uP), V {x,u) G V, 

where V is some open sphere centered at the origin, 
13,p are some positive scalars, and || ■ || is the standard 
Euclidean norm. 

Then condition (2) of Prop.\^is satisfied, while x = 0 is cost- 
free and absorbing [cf. Eq. (07. Still, however, in the absence 
of additional assumptions, there may be multiple solutions to 
Bellman’s equation within J. 

As an example, consider the scalar system Xk+i = ax^-GUk 
with X = U(x) = 5ft, and the quadratic cost g{x,u) = u^. 
Then Bellman’s equation has the form 

J(a:) = min + J(aa; + rt)), a; G 5ft, 

and it is seen that the optimal cost function, J* (x) = 0, 
is a solution. Let us assume that a > 1 so the system is 


unstable (the instability of the system is important for the 
purpose of this example). Then it can be verified that the 
quadratic function J{x) = {a? — l)x^, which belongs to 
3, also solves Bellman’s equation. This is a case where the 
algebraic Riccati equation associated with the problem has 
two nonnegative solutions because there is no cost on the 
state, and a standard observability condition for uniqueness 
of solution of the Riccati equation is violated. 

If on the other hand, in addition to (a)-(c), we assume that 
for some positive scalars 'y,p, we have inf„gu^( 2 ,) g(a;, w) > 
7 ||a:p/or all x G 5ft", then J*(x) > 0 for all x 0 [cf. Eq. 
(07, while condition (1) of Prop.\^is satisfied as well [cf. Eq. 
f l2QI ) [. Then by Prop. |5] Assumption |7] holds, and Bellman’s 
equation has a unique solution within 3- 

There are straightforward extensions of the conditions of 
the preceding example to a nonlinear system. Note that even 
for a controllable system, it is possible that there exist states 
from which the terminal set cannot be reached, because U(x) 
may imply constraints on the magnitude of the control vector. 
Still the preceding analysis allows for this case. 


B. An Optimistic Form of PI 

Let us consider a variant of PI where policies are evaluated 
inexactly, with a finite number of Vis. In particular, this 
algorithm starts with some Jq G E{X), and generates a 
sequence of cost function and policy pairs {J^, as follows: 
Given J^, we generate according to 

p!^{x)Gaig min {g{x,u) + Jk{f{x,u))), x G X, 

u€U(x) 

( 21 ) 

and then we obtain Jk+i with rrik > 1 Vis using p^: 

mfc-l 

(3:0) = Jkip^mk) T ^ ^ {xt)'^ , Xq G X, 

t=0 

( 22 ) 

where {xt} is the sequence generated using p^ and starting 
from Xg, and are arbitrary positive integers. Here Jg is a 
function in 3 that is required to satisfy 

Jg(x) > inf {g(x,u)-l-Jg(f(x,u))j, VxGX,uGU(x). 

uGU(x) 

(23) 

For example Jg may be equal to the cost function of some 
stationary policy, or be the function that takes the value 0 
for X G Xg and 00 at x ^ Xg. Note that when rrik = 1 
the method is equivalent to VI, while the case mt = 00 
corresponds to the standard PI considered earlier. In practice, 
the most effective value of rrik may be found experimentally, 
with moderate values rrik > 1 usually working best. We refer 
to the textbooks [25] and [11] for discussions of this type of 
inexact PI algorithm (in [25] it is called “modified” PI, while 
in [11] it is called “optimistic” PI). 

Proposition 6 (Convergence of Optimistic PI). Let Assump- 
tion\I\hold. For the PI algorithm l l27l) -l l22l) . where Jg belongs 
to 3 and satisfies the condition ( 1231) . we have Jk f J*. 
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Proof: We have for all x € X, 

Jo{x) > inf {g{x,u) + Jo{f{x,u))} 

u^U {x) 

= g{x,tJ°{x)) + Jo{f{x,g°{x))) 

> Ji{x) 

> g{x,g°{x)) + Ji{f{x,g.°{x))) 

> inf {g{x, u) + Ji {f{x, u)) } 

u^U {x) 

= g{x,n\x)) + Ji{f{x,g,\x))) 

> Mx), 

where the first inequality is the condition (l2?t . the second and 
third inequalities follow because of the monotonicity of the 
Too value iterations (l22l l for /i°, and the fourth inequality fol¬ 
lows from the policy improvement equation (l2Tl l. Continuing 
similarly, we have 

Jk{x)> inf {g{x,u) + Jk{f{x,u))] > Jk+i{x), 

u^U {x) 

for all cc S X and k. Moreover, since Jq G 3, we have Jk G 3 
for all k. Thus Jk i Joo for some Joo G 3, and similar to the 
proof of Prop. |2 it follows that Joo is a solution of Bellman’s 
equation. Hence, by the uniqueness result of Prop. [T] we have 
Joo = J*. □ 


C. Minimax Control to a Terminal Set of States 

Our analysis can be readily extended to minimax problems 
with a terminal set of states. Here the system is 

Xk+I = f{xk,Uk,Wk), fc = 0,1,..., 

where Wk is the control of an antagonistic opponent that aims 
to maximize the cost function. We assume that Wk is chosen 
from a given set W to maximize the sum of costs per stage, 
which are assumed nonnegative; 

0 < gix, u, w) < oo, X G X, U G U(x), w GW. 


problem that is closely related is reachability of a target set 
in minimum time, which is obtained for 


g{x,u,w) 


0 if X G Xs, 
1 if X ^ Xs, 


assuming also that the control process stops once the state 
enters the set Xg. Here ix is a disturbance described by set 
membership (w G W), and the objective is to reach the set Xg 
in the minimum guaranteed number of steps. The set Xf is 
the set of states for which Xg is guaranteed to be reached in a 
finite number of steps. Another related problem is reachability 
of a target tube, where for a given set X, 

, , f 0 if X G X, 

gix, u,w) = < 

If ifx^X, 


and the objective is to find the initial states starting from 
which we can guarantee to keep all future states within X. 
These two reachability problems were first formulated and 
analyzed as part of the author’s Ph.D. thesis research [7], and 
the subsequent paper [8]. In fact the reachability algorithms 
given in these works are essentially special cases of the VI 
algorithm of the present paper, starting with appropriate initial 
functions Jq. 

To extend our results to the general form of the minimax 
problem described above, we need to adapt the definition of 
termination. In particular, given a state x, in the minimax 
context we say that a policy tt terminates from x if there 
exists an index k [which depends on (7r,x)] such that the 
sequence {x^}, which is generated starting from x and using 
TT, satisfies xj. G Xg for all sequences {txo, ■ • ■ with 

Wt G W for all f = 0,... fc—1. Then Assumption[T]is modified 
to reflect this new definition of termination, and our results can 
be readily extended, with Props. [ri|2]|2 and|6] and their proofs, 
holding essentially as stated. The main adjustment needed is 
to replace expressions of the forms 


gix,u) + J{f{x,u)) 


We wish to choose a policy tt = {/xq, pi, ...} to minimize 
the cost function 


J^rixo) 


sup 

fe=0,l,.. 


k 

lim y^p(xfe,pfe(xfe),Wfc), 

fe—>oo ^' 
i=0 


where {xfe,/ife(xfe)} is a state-control sequence corresponding 
to TT and the sequence {wq, wi ,...}. We assume that there is 
a termination set Xg, the states of which are cost-free and 
absorbing, i.e.. 


^(x,^,^) = 0, x = f{x,u,w), 

for all X G Xg, u G U{x), w G W, and that all states outside 
Xg have strictly positive optimal cost, so that 

Xg = {xGXlJ*(x)=0}. 

The finite-state version of this problem has been discussed in 
[13], under the name robust shortest path planning, for the 
case where g can take both positive and negative values. A 


and 


fc-i 


J(xfe) + '^g{xt,ut) 


t=0 


in these proofs with 

sup_ {g{x, u,w) + J (/(x, u, w)) } 

and 


wGW 


sup < J{xk)+y^gixt,ut,wt)} , 
f—^ 

t = 0,...,fe-l t, E —u J 

respectively; see also [14] for a more abstract view of such 
lines of argument. 


V. Concluding Remarks 

In this paper we have considered problems of deterministic 
optimal control to a terminal set of states subject to very 
general assumptions. Under reasonably practical conditions, 
we have established the uniqueness of solution of Bellman’s 
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equation, and the convergence of value and policy iteration al¬ 
gorithms, even when there are states with infinite optimal cost. 
Our analysis bypasses the need for assumptions involving the 
existence of globally stabilizing controllers, which guarantee 
that the optimal cost function J* is real-valued. This generality 
makes our results a convenient starting point for analysis of 
problems involving additional assumptions, and perhaps cost 
function approximations. 

While we have restricted attention to undiscounted prob¬ 
lems, the line of analysis of the present paper applies also to 
discounted problems with one-stage cost function g that may 
be unbounded from above. Similar but more favorable results 
can be obtained, thanks to the presence of the discount factor; 
see the author’s paper [14], which contains related analysis 
for stochastic and minimax, discounted and undiscounted 
problems, with nonnegative cost per stage. 

The results for these problems, and the results of the present 
paper, have a common ancestry. They fundamentally draw 
their validity from notions of regularity, which were developed 
in the author’s abstract DP monograph [12] and were extended 
recently in [14]. Let us describe the regularity idea briefly, 
and its connection to the analysis of this paper. Given a set of 
functions S € E^{X), we say that a collection C of policy- 
state pairs (tTjXq), with tt € If and xq € X, is S-regular if 
for all (tt, xo) S C and J G S, we have 

J^vixo) = lim -( J(xfc) + y^g{xt,gtixt)) } ■ 
k —^ oo I I 

I t=0 J 

In words, for all {tt,xo) € 6, JT^ixo) can be obtained in 
the limit by VI starting from any J G S. The favorable 
properties with respect to VI of an S'-regular collection C can 
be translated into interesting properties relating to solutions of 
Bellman’s equation and convergence of VI. In particular, the 
optimal cost function over the set of policies {tt | (tt, x) G C}, 

Jeix) = ^ , inf ^J^ix), x G X, 

{tt I (7r,x)GC} 

under appropriate problem-dependent assumptions, is the 
unique solution of Bellman’s equation within the set {J G 
S' I J > Jg}, and can be obtained by VI starting from any J 
within that set (see [14]). 

Within the deterministic optimal control context of this 
paper, it works well to choose C to be the set of all (tt, x) 
such that X G Xf and tt is terminating starting from x, and 
to choose S to be 3, as defined by Eq. (ITOl i. Then, in view of 
Assumption 1, we have Jq = J*, and the favorable properties 
of Jg are shared by J*. For other types of problems different 
choices of C may be appropriate, and corresponding results 
relating to the uniqueness of solutions of Bellman’s equation 
and the validity of value and policy iteration may be obtained; 
see [14]. 
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